This notebook is one in a series of many, where we explore how different data analysis strategies affect the outcome of a proteomics experiment based on isobaric labeling and mass spectrometry. Each analysis strategy or ‘workflow’ can be divided up into different components; it is recommend you read more about that in the introduction notebook.

In this notebook specifically, we investigate the effect of varying the Normalization component on the outcome of the differential expression results. The five component variants are: median sweeping, CONSTANd, NOMAD, quantile1 (first on PSM level), quantile2 (only on protein level).

The R packages and helper scripts necessary to run this notebook are listed in the next code chunk: click the ‘Code’ button. Each code section can be expanded in a similar fashion. You can also download the entire notebook source code.

library(stringi)
library(gridExtra)
library(dendextend)
library(kableExtra)
library(psych)
library(limma)
library(tidyverse)
library(preprocessCore)
library(CONSTANd)  # install from source: https://github.com/PDiracDelta/CONSTANd/
library(NOMAD)  # devtools::install_github("carlmurie/NOMAD")
source('util/other_functions.R')
source('util/plotting_functions.R')

Let’s load our PSM-level data set:

data.list <- readRDS(params$input_data_p)
dat.l <- data.list$dat.l # data in long format
display_dataframe_head(dat.l)
Mixture TechRepMixture Condition BioReplicate Run Channel Protein Peptide RT Charge PTM intensity TotalIntensity Ions.Score DeltaMZ isoInterOk noNAs onehit.protein shared.peptide
Mixture2 1 1 Mixture2_1 Mixture2_1 127C P21291 [K].gFGFGQGAGALVHSE.[-] 144.9444 2 N-Term(TMT6plex) 44048.41 394657.6 65 0.00052 TRUE TRUE FALSE FALSE
Mixture2 1 0.5 Mixture2_0.5 Mixture2_1 127N P21291 [K].gFGFGQGAGALVHSE.[-] 144.9444 2 N-Term(TMT6plex) 38298.89 394657.6 65 0.00052 TRUE TRUE FALSE FALSE
Mixture2 1 0.125 Mixture2_0.125 Mixture2_1 128C P21291 [K].gFGFGQGAGALVHSE.[-] 144.9444 2 N-Term(TMT6plex) 39552.81 394657.6 65 0.00052 TRUE TRUE FALSE FALSE
Mixture2 1 0.667 Mixture2_0.667 Mixture2_1 128N P21291 [K].gFGFGQGAGALVHSE.[-] 144.9444 2 N-Term(TMT6plex) 56730.70 394657.6 65 0.00052 TRUE TRUE FALSE FALSE
Mixture2 1 0.667 Mixture2_0.667 Mixture2_1 129C P21291 [K].gFGFGQGAGALVHSE.[-] 144.9444 2 N-Term(TMT6plex) 48744.49 394657.6 65 0.00052 TRUE TRUE FALSE FALSE
Mixture2 1 1 Mixture2_1 Mixture2_1 129N P21291 [K].gFGFGQGAGALVHSE.[-] 144.9444 2 N-Term(TMT6plex) 60179.96 394657.6 65 0.00052 TRUE TRUE FALSE FALSE

After the filtering done in data_prep.R, there are 19 UPS1 proteins remaining, even though 48 were originally spiked in.

# which proteins were spiked in?
spiked.proteins <- dat.l %>% distinct(Protein) %>% filter(stri_detect(Protein, fixed='ups')) %>% pull %>% as.character
remove_factors(spiked.proteins)
##  [1] "P02787ups"   "P02788ups"   "P01133ups"   "P69905ups"   "P01008ups"  
##  [6] "P02741ups"   "P01375ups"   "P10636-8ups" "O00762ups"   "P62988ups"  
## [11] "P10145ups"   "P05413ups"   "P00915ups"   "P02753ups"   "P02144ups"  
## [16] "P01344ups"   "P01579ups"   "P00709ups"   "P01127ups"
tmp=dat.l %>% distinct(Protein) %>% pull %>% as.character
# protein subsampling
if (params$subsample_p>0 & params$subsample_p==floor(params$subsample_p) & params$subsample_p<=length(tmp)){
  sub.prot <- tmp[sample(1:length(tmp), size=params$subsample_p)]
  if (length(spiked.proteins)>0) sub.prot <- c(sub.prot,spiked.proteins)
  dat.l <- dat.l %>% filter(Protein %in% sub.prot)
}

We store the metadata in sample.info and show some entries below. We also pick technical replicates with a dilution factor of 0.5 as the reference condition of interest. Each condition is represented by two of eight reporter Channels in each Run.

display_dataframe_head(sample.info)
Run Run.short Channel Sample Sample.short Condition Color
Mixture1_1 Mix1_1 127C Mixture1_1:127C Mix1_1:127C 0.125 black
Mixture1_1 Mix1_1 127N Mixture1_1:127N Mix1_1:127N 0.667 green
Mixture1_1 Mix1_1 128C Mixture1_1:128C Mix1_1:128C 1 red
Mixture1_1 Mix1_1 128N Mixture1_1:128N Mix1_1:128N 0.5 blue
Mixture1_1 Mix1_1 129C Mixture1_1:129C Mix1_1:129C 0.5 blue
Mixture1_1 Mix1_1 129N Mixture1_1:129N Mix1_1:129N 0.125 black
referenceCondition
## [1] "0.5"
channelNames
## [1] "127C" "127N" "128C" "128N" "129C" "129N" "130C" "130N"

1 Unit scale component: log2 transformation of reporter ion intensities

We use the default unit scale: the log2-transformed reportion ion intensities.

dat.unit.l <- dat.l %>% mutate(response=log2(intensity)) %>% select(-intensity)
display_dataframe_head(dat.unit.l)
Mixture TechRepMixture Condition BioReplicate Run Channel Protein Peptide RT Charge PTM TotalIntensity Ions.Score DeltaMZ isoInterOk noNAs onehit.protein shared.peptide response
Mixture2 1 1 Mixture2_1 Mixture2_1 127C P21291 [K].gFGFGQGAGALVHSE.[-] 144.9444 2 N-Term(TMT6plex) 394657.6 65 0.00052 TRUE TRUE FALSE FALSE 15.42680
Mixture2 1 0.5 Mixture2_0.5 Mixture2_1 127N P21291 [K].gFGFGQGAGALVHSE.[-] 144.9444 2 N-Term(TMT6plex) 394657.6 65 0.00052 TRUE TRUE FALSE FALSE 15.22502
Mixture2 1 0.125 Mixture2_0.125 Mixture2_1 128C P21291 [K].gFGFGQGAGALVHSE.[-] 144.9444 2 N-Term(TMT6plex) 394657.6 65 0.00052 TRUE TRUE FALSE FALSE 15.27149
Mixture2 1 0.667 Mixture2_0.667 Mixture2_1 128N P21291 [K].gFGFGQGAGALVHSE.[-] 144.9444 2 N-Term(TMT6plex) 394657.6 65 0.00052 TRUE TRUE FALSE FALSE 15.79184
Mixture2 1 0.667 Mixture2_0.667 Mixture2_1 129C P21291 [K].gFGFGQGAGALVHSE.[-] 144.9444 2 N-Term(TMT6plex) 394657.6 65 0.00052 TRUE TRUE FALSE FALSE 15.57295
Mixture2 1 1 Mixture2_1 Mixture2_1 129N P21291 [K].gFGFGQGAGALVHSE.[-] 144.9444 2 N-Term(TMT6plex) 394657.6 65 0.00052 TRUE TRUE FALSE FALSE 15.87700

2 Normalization component

In the next three subsections, let’s look at our different ways to normalize our data on the PSM level.

Since these methods all need to be applied on matrix-like data, let’s switch to wide format. (Actually, this is semi-wide, since the Channel columns still have contributions form all Runs, but that’s OK because in the next step we split by Run.)

# switch to wide format
dat.unit.w <- pivot_wider(data = dat.unit.l, id_cols=-one_of(c('Condition', 'BioReplicate')), names_from=Channel, values_from=response)
display_dataframe_head(dat.unit.w)
Mixture TechRepMixture Run Protein Peptide RT Charge PTM TotalIntensity Ions.Score DeltaMZ isoInterOk noNAs onehit.protein shared.peptide 127C 127N 128C 128N 129C 129N 130C 130N
Mixture2 1 Mixture2_1 P21291 [K].gFGFGQGAGALVHSE.[-] 144.9444 2 N-Term(TMT6plex) 394657.56 65 0.00052 TRUE TRUE FALSE FALSE 15.42680 15.22502 15.27149 15.79184 15.57295 15.87700 15.71883 15.69836
Mixture2 1 Mixture2_1 P30511 [K].wAAVVVPPGEEQR.[Y] 137.1704 2 N-Term(TMT6plex) 180359.11 27 -0.00653 TRUE TRUE TRUE FALSE 14.36513 14.14964 14.27690 14.65920 14.44955 14.58652 14.54510 14.57900
Mixture2 1 Mixture2_1 P02787ups [K].sASDLTWDNLk.[G] 148.8299 3 N-Term(TMT6plex); K11(TMT6plex) 507218.12 35 -0.00005 TRUE TRUE FALSE FALSE 16.31046 15.32833 15.11530 16.27180 15.92615 16.73355 15.74796 15.42268
Mixture2 1 Mixture2_1 Q9UBQ5 [K].iDFDSVSSIMASSQ.[-] 201.8727 2 N-Term(TMT6plex) 120944.32 72 0.00119 TRUE TRUE FALSE FALSE 13.51956 13.51995 13.66172 14.36018 13.87050 14.20738 13.86685 13.83673
Mixture2 1 Mixture2_1 Q5H9R7 [R].tGQPSAPGDTSVNGPV.[-] 106.9002 2 N-Term(TMT6plex) 99098.46 23 -0.00133 TRUE TRUE FALSE FALSE 13.54513 13.17495 13.34133 13.81323 13.58510 14.02262 13.60511 13.51803
Mixture2 1 Mixture2_1 P43490 [K].nAQLNIELEAAHH.[-] 99.8437 3 N-Term(TMT6plex) 497023.17 15 0.00061 TRUE TRUE FALSE FALSE 15.58965 15.43193 15.86127 16.18651 15.86677 16.28384 15.93733 16.03418
dat.norm.w <- emptyList(variant.names)

2.1 median sweeping (1)

Median sweeping means subtracting from each PSM quantification value the spectrum median (i.e., the row median computed across samples/channels) and the sample median (i.e., the column median computed across features). If the unit scale is set to intensities or ratios, the multiplicative variant of this procedure is applied: subtraction is replaced by division. First, let’s sweep the medians of all the rows, and do the columns later as suggested by [Herbrich at al.](https://doi.org/10.1021/pr300624g. No need to split this per Run, because each row in this semi-wide format contains only values from one Run and each median calculation is independent of the other rows.

# subtract the spectrum median log2intensity from the observed log2intensities
dat.norm.w$median_sweeping <- dat.unit.w
dat.norm.w$median_sweeping[,channelNames] <- median_sweep(dat.norm.w$median_sweeping[,channelNames], 1, '-') 

2.2 CONSTANd

CONSTANd (Van Houtven et al.) normalizes a data matrix by ‘raking’ iteratively along the rows and columns (i.e. multiplying each row or column with a particular number) such that the row means and column means equal 1. One can never attain these row and column constraints simultaneously, but the algorithm converges very fast to the desired precision.

# dat.unit.l entries are in long format so all have same colnames and no channelNames
x.split <- split(dat.unit.w, dat.unit.w$Run)  # apply CONSTANd to each Run separately
x.split.norm  <- lapply(x.split, function(y) {
  y[,channelNames] <- CONSTANd(y[,channelNames])$normalized_data
  return(y)
})
dat.norm.w$CONSTANd <- bind_rows(x.split.norm)
display_dataframe_head(dat.norm.w$CONSTANd[, channelNames])
127C 127N 128C 128N 129C 129N 130C 130N
1.0070564 1.0119939 0.9735178 0.9986356 1.0126119 1.0100713 0.9901744 0.9959388
0.9997616 0.9985738 0.9998317 1.0070210 0.9873465 0.9984853 0.9987635 1.0102166
0.9940606 1.0048315 1.0014354 0.9933211 1.0012260 1.0077643 0.9955799 1.0017812
1.0069406 1.0001663 0.9893311 0.9970276 0.9978282 0.9993591 0.9989784 1.0103687
1.0025539 1.0100617 0.9989662 0.9943918 0.9940122 0.9814986 1.0073117 1.0112039
0.9918683 0.9987713 1.0031477 0.9980860 0.9925122 1.0000085 1.0056106 1.0099954

2.3 NOMAD

NOMAD (Murie et al.) is a computationally efficient implementation akin to linear regression, similar to Tukey’s median polish but using the mean: it subtracts the mean from each subset (defined by a categorical covariate) of quantification values, sequentially for each covariate. We apply NOMAD on the PSM level instead of the peptide level.

# doRobust=F: use means, like CONSTANd; doLog=F: values are already transformed.
dat.nomadnorm <- nomadNormalization(dat.unit.l$response, dat.unit.l %>% rename(iTRAQ=Channel) %>% as.data.frame, doRobust = FALSE, doiTRAQCorrection = FALSE, doLog = FALSE)
## Running normalization with  464224  number of data points
## Normalizing for factor:  Peptide 
## Normalizing for factor:  Run 
## Normalizing for factor:  iTRAQ 
## Normalizing for factor:  Run Peptide 
## Normalizing for factor:  Run iTRAQ
dat.nomadnorm$x$response <- dat.nomadnorm$y
dat.norm.w$NOMAD <- pivot_wider(data = dat.nomadnorm$x, id_cols=-one_of(c('Condition', 'BioReplicate')), names_from=iTRAQ, values_from=response)
display_dataframe_head(dat.norm.w$NOMAD[, channelNames])
127C 127N 128C 128N 129C 129N 130C 130N
0.0815099 -0.0090445 -0.1344888 -0.0490965 -0.0022344 -0.0098975 0.0897971 0.0334546
0.1412433 0.0369894 -0.0076766 -0.0603354 -0.0042326 -0.1789694 0.0374800 0.0355011
0.6809264 -0.1899760 -0.5749248 0.1466196 0.0667224 0.5624151 -0.1653113 -0.5264712
-0.1083032 0.0033191 -0.0268297 0.2366714 0.0127382 0.0379114 -0.0447572 -0.1107501
0.1969322 -0.0620115 -0.0675504 -0.0306049 0.0070092 0.1328236 -0.0268226 -0.1497756
-0.0817934 -0.1282747 0.1291359 0.0194186 -0.0345609 0.0708003 -0.0178522 0.0431265

2.4 Quantile (1)

Quantile normalization (As implemented by (Bolstad et al.)[https://doi.org/10.1093/bioinformatics/19.2.185]) makes the distribution (i.e., the values of the quantiles) of quantification values in different samples identical. We first apply it to each Run separately, and then re-scale the observations so that the mean observation within in each run is set equal to the mean observation across all runs. After summarization, we do a second pass on the matrix with data from all runs.

grand_average <- mean(as.matrix(dat.unit.w[,channelNames]))
x.split <- split(dat.unit.w, dat.unit.w$Run)  
x.split.norm <- lapply(x.split, function(y) {
  # apply normalizeQuantiles to each Run separately
  y[,channelNames] <- normalize.quantiles(as.matrix(y[,channelNames]))
  # make averages of all runs equal.
  y[,channelNames] <- y[,channelNames] / mean(colMeans(y[,channelNames])) * grand_average
  return(y)
})
dat.norm.w$quantile1 <- bind_rows(x.split.norm)
display_dataframe_head(dat.norm.w$quantile1[, channelNames])
127C 127N 128C 128N 129C 129N 130C 130N
15.03970 15.10688 14.53310 14.91507 15.13058 15.09261 14.79507 14.87922
16.11481 16.10985 16.17243 16.29760 15.96187 16.15730 16.15227 16.27854
15.07107 15.22315 15.18371 15.06272 15.18983 15.28967 15.09679 15.18206
18.23926 18.14132 18.08989 18.22935 18.20234 18.25349 18.24169 18.28337
15.79432 15.92202 15.78049 15.72100 15.69636 15.51384 15.91003 15.92120
15.10098 15.19336 15.27591 15.20582 15.11785 15.23672 15.31646 15.35316

3 Summarization component: Median summarization

Within each Run and within each Channel, we replace multiple related observations with their median. First, for each Peptide (median of the PSM values), then for each Protein (median of the peptide values).

# normalized data
dat.norm.summ.w <- lapply(dat.norm.w, function(x){
  y <- x %>% group_by(Run, Protein, Peptide) %>% summarise_at(.vars = channelNames, .funs = median, na.rm=T) %>% summarise_at(.vars = channelNames, .funs = median, na.rm=T) %>% ungroup()
  return(y)})

Let’s also summarize the non-normalized data for comparison later on.

# non-normalized data
# group by (run,)protein,peptide then summarize twice (once on each level)
# add select() statement because summarise_at is going bananas over character columns
dat.nonnorm.summ.w <- dat.unit.w %>% group_by(Run, Protein, Peptide) %>% select(Run, Protein, Peptide, channelNames) %>% summarise_at(.vars = channelNames, .funs = median) %>% select(Run, Protein, channelNames) %>% summarise_at(.vars = channelNames, .funs = median) %>% ungroup()
# make data completely wide (also across runs)
## normalized data
dat.norm.summ.w2 <- lapply(dat.norm.summ.w, function(x) {
  return(x %>% pivot_wider(names_from = Run, values_from = all_of(channelNames), names_glue = "{Run}:{.value}"))
})

4 Normalization component (2)

Some normalization methods require a second stage after summarization, and the quantile2 method is applied only on the protein-level data.

4.1 Median sweeping (2)

Now that the data is on the protein level, let’s sweep all values separately per protein in the columns/samples. This is slightly different from sweeping before the summarization step because the median of medians is not the same as the grand median, but this does not introduce any bias.

# median sweeping: in each channel, subtract median computed across all proteins within the channel, separately for each MS run.
tmp <- dat.norm.summ.w2$median_sweeping %>% select(-Protein)
dat.norm.summ.w2$median_sweeping[, colnames(dat.norm.summ.w2$median_sweeping)!='Protein'] <- median_sweep(tmp, 2, '-') 
# propagate the second round of median sweeping normalization from dat.norm.summ.w2 to dat.norm.summ.w
dat.norm.summ.w$median_sweeping <- dat.norm.summ.w2$median_sweeping %>% pivot_longer(cols=-one_of('Protein'), names_to='Sample', values_to='response', values_drop_na = T) %>% inner_join(sample.info[,c('Sample', 'Run', 'Channel', 'Condition')], by='Sample') %>% pivot_wider(id_cols=-one_of('Condition', 'Sample'), names_from=Channel, values_from=response)

4.2 Quantile (2)

Quantiles have been equalized within-run, but not yet across runs, so let’s do that now.

# apply normalizeQuantiles again, now to the data from all runs simultaneously
dat.norm.summ.w2$quantile1[,sample.info$Sample] <- normalize.quantiles(as.matrix(dat.norm.summ.w2$quantile1[,sample.info$Sample]))
# propagate the second round of quantile1 normalization from dat.norm.summ.w2 to dat.norm.summ.w
dat.norm.summ.w$quantile1 <- dat.norm.summ.w2$quantile1 %>% pivot_longer(cols=-one_of('Protein'), names_to='Sample', values_to='response', values_drop_na = T) %>% inner_join(sample.info[,c('Sample', 'Run', 'Channel', 'Condition')], by='Sample') %>% pivot_wider(id_cols=-one_of('Condition', 'Sample'), names_from=Channel, values_from=response)

4.3 Quantile norm of protein-level data by condition

As an alternative, we also perform quantile normalization on protein-level data only.

dat.nonnorm.summ.l <- to_long_format(dat.nonnorm.summ.w, sample.info)
x.split <- split(dat.nonnorm.summ.l, dat.nonnorm.summ.l$Condition)
x.split.norm <- lapply(x.split, function(x){
  tmp.wide <- pivot_wider(data=x, id_cols='Protein', names_from=Sample, values_from=response) 
  tmp.wide[, colnames(tmp.wide)!='Protein'] <- normalize.quantiles(as.matrix(tmp.wide %>% select(-Protein)))
  tmp <- pivot_longer(tmp.wide, cols=-one_of('Protein'), names_to='Sample', values_to='response') %>% drop_na() %>% inner_join(sample.info[,c('Sample', 'Run', 'Channel', 'Condition')], by='Sample')
  return(tmp)
})
x <- bind_rows(x.split.norm)
dat.norm.summ.w$quantile2 <- pivot_wider(data = x, id_cols=-one_of('Condition', 'Sample'), names_from=Channel, values_from=response)
dat.norm.summ.w2$quantile2 <- dat.norm.summ.w$quantile2 %>% pivot_wider(names_from = Run, values_from = all_of(channelNames), names_glue = "{Run}:{.value}")
variant.names <- c('median_sweeping', 'CONSTANd', 'NOMAD', 'quantile1', 'quantile2')
n.comp.variants <- length(variant.names)

5 QC plots

Before getting to the DEA section, let’s do some basic quality control and take a sneak peek at the differences between the component variants we’ve chosen. First, however, we should make the data completely wide, so that each sample gets it’s own unique column.

# make data completely wide (also across runs)
## non-normalized data
dat.nonnorm.summ.w2 <- dat.nonnorm.summ.w %>% pivot_wider(names_from = Run, values_from = all_of(channelNames), names_glue = "{Run}:{.value}")

5.1 Boxplots

These boxplots of both the raw and normalized intensities show that the distributions are all symmetrical. The shapes of the median sweeping, CONSTANd and NOMAD distributions are very similar, just like those of the two quantile approaches which also retain an absolute magnitude.

# use (half-)wide format
par(mfrow=c(3,2))
boxplot_w(dat.nonnorm.summ.w,sample.info, 'raw')
for (i in 1:n.comp.variants){
  boxplot_w(dat.norm.summ.w[[variant.names[i]]], sample.info, paste('normalized', variant.names[i], sep='_'))}

5.2 MA plots

We then make MA plots of two single samples taken from condition 0.5 and condition 0.125, measured in different MS runs (samples Mixture2_1:127C and Mixture1_2:129N, respectively). Clearly, the normalization had a strong variance-reducing effect on the fold changes. It seems that fold changes after CONSTANd summarization are disproportionately small, because as we will see later, CONSTANd is only suitable for use with untransformed intensities. The two plots of quantile approaches are near-indistinguishable.

# use wide2 format
p <- vector('list', n.comp.variants+1)
p[[1]] <- maplot_ils(dat.nonnorm.summ.w2, ma.onesample.num, ma.onesample.denom, scale.vec, 'raw', spiked.proteins)
for (i in 1: n.comp.variants){
 p[[i+1]]<- maplot_ils(dat.norm.summ.w2[[variant.names[i]]], ma.onesample.num, ma.onesample.denom, scale.vec, paste('normalized', variant.names[i], sep='_'), spiked.proteins)}
grid.arrange(grobs = p, ncol=2, nrow=3)

To increase the robustness of these results, let’s make some more MA plots, but now for all samples from condition 0.5 and condition 0.125 (quantification values averaged within condition). Indeed, even the raw, unnormalized data now show less variability. The two plots of quantile approaches are still near-indistinguishable.

channels.num <- sample.info %>% filter(Condition==ma.allsamples.num) %>% select(Sample) %>% pull
channels.denom <- sample.info %>% filter(Condition==ma.allsamples.denom) %>% select(Sample) %>% pull
p <- vector('list', n.comp.variants+1)
p[[1]] <- maplot_ils(dat.nonnorm.summ.w2, channels.num, channels.denom, scale.vec, 'raw', spiked.proteins)
for (i in 1: n.comp.variants){
 p[[i+1]]<- maplot_ils(dat.norm.summ.w2[[variant.names[i]]], channels.num, channels.denom, scale.vec, paste('normalized', variant.names[i], sep='_'), spiked.proteins)}
grid.arrange(grobs = p, ncol=2)

5.3 PCA plots

Now, let’s check if these multi-dimensional data contains some kind of grouping; It’s time to make PCA plots.

5.3.1 Using all proteins

Even though PC1 does seem to capture the conditions, providing a gradient for the dilution number, only the 0.125 condition is completely separable in the normalized data. Here, clearly both quantile approaches are insufficient as they only barely change the variance structure. Meanwhile, median sweeping, CONSTANd and NOMAD produce very similar PCA plots (CONSTANd’s PC1 direction is just inverted).

par(mfrow=c(3,2))
pcaplot_ils(dat.nonnorm.summ.w2 %>% select(-'Protein'), info=sample.info, 'raw')
for (i in 1:n.comp.variants){
pcaplot_ils(dat.norm.summ.w2[[variant.names[i]]] %>% select(-'Protein'), info=sample.info, paste('normalized', variant.names[i], sep='_'))}

There are only 19 proteins supposed to be differentially expressed in this data set, which is only a very small amount in both relative (to the 4083 proteins total) and absolute (for a biological sample) terms.

5.3.2 Using spiked proteins only

Therefore, let’s see what the PCA plots look like if we were to only use the spiked proteins in the PCA. After normalization, the non-quantile variants produce similar plots where only conditions 0.5 and 0.667 aren’t clearly separable. The quantile approaches again barely change the variance structure of the raw data.

par(mfrow=c(3,2))
pcaplot_ils(dat.nonnorm.summ.w2 %>% filter(Protein %in% spiked.proteins) %>% select(-'Protein'), info=sample.info, 'raw')
for (i in 1:n.comp.variants){
  pcaplot_ils(dat.norm.summ.w2[[variant.names[i]]] %>% filter(Protein %in% spiked.proteins) %>% select(-'Protein'), info=sample.info, paste('normalized', variant.names[i], sep='_'))}

Notice how for all PCA plots, the percentage of variance explained by PC1 is now much greater than when using data from all proteins. In a real situation without spiked proteins, you might plot data corresponding to the top X most differential proteins instead.

5.4 HC (hierarchical clustering) plots

The PCA plots of all proteins has a rather lower fraction of variance explained by PC1. We can confirm this using the hierarchical clustering dendrograms below: when considering the entire multidimensional space, the different conditions are not very separable at all (except for condition 0.125 after the non-quantile normalization approaches). This is not surprising as there is little biological variation between the conditions: there are only 19 truly differential proteins, and they all (ought to) covary in exactly the same manner (i.e., their variation can be captured in one dimension).

par(mfrow=c(3,2))
dendrogram_ils(dat.nonnorm.summ.w2 %>% select(-Protein), info=sample.info, 'raw')
for (i in 1:n.comp.variants){
  dendrogram_ils(dat.norm.summ.w2[[variant.names[i]]] %>% select(-Protein), info=sample.info, paste('normalized', variant.names[i], sep='_'))}

5.5 Run effect p-value plot

Our last quality check involves a measure of how well each variant was able to assist in removing the run effect. Below are the distributions of p-values from a linear model for the response variable with Run as a covariate. If the run effect was removed successfully, these p-values ought to be large.

Clearly, the raw data contains a run effect, which is partially removed by median sweeping and as good as completely removed by CONSTANd and NOMAD. Both quantile approaches do not remove the run effect (their curves overlaps with the raw data curve).

dat <- vector('list',length(dat.norm.summ.l)+1)
dat[[1]] <- dat.nonnorm.summ.l; dat[2:length(dat)] <- dat.norm.summ.l[1:length(dat.norm.summ.l)]
names(dat) <- c('raw', names(dat.norm.summ.l))
run_effect_plot(dat)

6 DEA component: Moderated t-test

We look at the log2 fold changes of each condition w.r.t. the reference condition with dilution ratio 0.125. Since we are working with a log2 unit scale already, this means that for each protein we just look at the difference in mean observation across all channels between one condition and the reference condition. Note that this is not the same as looking at the log2 of the ratio of mean raw intensities for each condition (left hand side below), nor the mean ratio of raw intensities for each condition (right hand side below), since \(log_2 (\frac{mean(B)}{mean(A)}) \neq \frac{mean(log_2 (B))}{mean(log_2 (A))} \neq mean(\frac{log_2 (B)}{log_2 (A)})\).

To check whether these fold changes are significant (criterium: \(q<0.05\)), we use a moderated t-test slightly adapted from the limma package, which in use cases like ours should improve statistical power over a regular t-test. In a nutshell, this is a t-test done independently for each protein, although the variance used in the calculation of the t-statistic is moderated using some empirical Bayes estimation. After testing, we make a correction for multiple testing using the Benjamini-Hochberg method in order to keep the FDR under control.

# design matrix as used in ANOVA testing.
design.matrix <- get_design_matrix(referenceCondition, sample.info)
dat.dea <- emptyList(names(dat.norm.summ.w2))
for (i in 1:n.comp.variants) {
  this_scale <- scale.vec
  d <- column_to_rownames(as.data.frame(dat.norm.summ.w2[[variant.names[i]]]), 'Protein')
  dat.dea[[variant.names[i]]] <- moderated_ttest(dat=d, design.matrix, scale=this_scale)}
# also see what the unnormalized results would look like
n.comp.variants <- n.comp.variants + 1
variant.names <- c(variant.names, 'raw')
dat.dea$raw <- moderated_ttest(dat=column_to_rownames(dat.nonnorm.summ.w2, 'Protein'), design.matrix, scale='log')

For each condition, we now get the fold changes, moderated and unmoderated p-values, moderated and unmoderated q-values (BH-adjusted p-values), and some other details (head of dataframe below).

display_dataframe_head(dat.dea[[1]])
logFC_0.125 logFC_0.667 logFC_1 t.ord_0.125 t.ord_0.667 t.ord_1 t.mod_0.125 t.mod_0.667 t.mod_1 p.ord_0.125 p.ord_0.667 p.ord_1 p.mod_0.125 p.mod_0.667 p.mod_1 q.ord_0.125 q.ord_0.667 q.ord_1 q.mod_0.125 q.mod_0.667 q.mod_1 df.r df.0 s2.0 s2 s2.post
A0AVT1 -0.0296332 -0.0190075 0.0086228 -1.2236300 -0.7848676 0.3560550 -1.1698892 -0.7503969 0.3404173 0.2312896 0.4391218 0.7244682 0.2512504 0.4588575 0.7359131 0.9569613 0.9960123 0.9898622 0.9350748 0.9805978 0.9840728 28 2.018963 0.0056242 0.0023459 0.0025664
A0FGR8 0.0121287 0.0389655 0.0247813 0.3627376 1.1653597 0.7411452 0.3596350 1.1553923 0.7348061 0.7195248 0.2537054 0.4647764 0.7216378 0.2570437 0.4681594 0.9841943 0.9960123 0.9820911 0.9795638 0.9723943 0.9650451 28 2.018963 0.0056242 0.0044720 0.0045495
A0MZ66 0.0035130 -0.0009938 -0.0183834 0.0761086 -0.0215311 -0.3982773 0.0769941 -0.0217816 -0.4029113 0.9398739 0.9829747 0.6934465 0.9391391 0.9827663 0.6898687 0.9978394 0.9975071 0.9898622 0.9964971 0.9975410 0.9825206 28 2.018963 0.0056242 0.0085220 0.0083271
A1L0T0 0.0520381 -0.0328378 -0.0500166 0.7951963 -0.5017969 -0.7643056 0.8165216 -0.5152539 -0.7848025 0.4358354 0.6212896 0.4536089 0.4229483 0.6115101 0.4409319 0.9569613 0.9975071 0.9820911 0.9350748 0.9924981 0.9650451 20 2.018963 0.0056242 0.0128474 0.0121851
A3KMH1 0.4342224 0.3206653 0.5484563 1.7510253 1.2931002 2.2116798 1.8344616 1.3547163 2.3170662 0.0952654 0.2107143 0.0387933 0.0801348 0.1892462 0.0301887 0.9569613 0.9960123 0.7960743 0.9350748 0.9540539 0.7680706 20 2.018963 0.0056242 0.1844849 0.1680848
A4D1E9 -0.0219254 -0.0731552 -0.0426655 -0.4771987 -1.5921945 -0.9285977 -0.4796669 -1.6004298 -0.9334006 0.6383938 0.1270236 0.3641701 0.6361944 0.1237544 0.3607366 0.9741019 0.9841938 0.9820911 0.9644451 0.9260254 0.9650451 20 2.018963 0.0056242 0.0063331 0.0062681

7 Results comparison

Now, the most important part: let’s find out how our component variants have affected the outcome of the DEA.

7.1 Confusion matrix

A confusion matrix shows how many true and false positives/negatives each variant has given rise to. Spiked proteins that are DE are true positives, background proteins that are not DE are true negatives. We calculate this matrix for all conditions and then calculate some other informative metrics based on the confusion matrices: accuracy, sensitivity, specificity, positive predictive value and negative predictive value.

Clearly, across the board, the quantile normalization approaches underperform while median sweeping, CONSTANd and NOMAD are on equal footing, with an acceptable but not spectacular sensitivity. That said, the contrast between conditions 0.667 and 0.5 seems not large enough to yield many significant results.

cm <- conf_mat(dat.dea, 'q.mod', 0.05, spiked.proteins)
print_conf_mat(cm, referenceCondition)
0.125 vs 0.5 contrast
median_sweeping
CONSTANd
NOMAD
quantile1
quantile2
raw
background spiked background spiked background spiked background spiked background spiked background spiked
not_DE 4061 4 4060 4 4062 4 4064 15 4064 15 4064 19
DE 3 15 4 15 2 15 0 4 0 4 0 0
0.125 vs 0.5 contrast
median_sweeping CONSTANd NOMAD quantile1 quantile2 raw
Accuracy 0.998 0.998 0.999 0.996 0.996 0.995
Sensitivity 0.789 0.789 0.789 0.211 0.211 0.000
Specificity 0.999 0.999 1.000 1.000 1.000 1.000
PPV 0.833 0.789 0.882 1.000 1.000 NaN
NPV 0.999 0.999 0.999 0.996 0.996 0.995
0.667 vs 0.5 contrast
median_sweeping
CONSTANd
NOMAD
quantile1
quantile2
raw
background spiked background spiked background spiked background spiked background spiked background spiked
not_DE 4064 18 4064 18 4064 17 4064 19 4064 19 4064 19
DE 0 1 0 1 0 2 0 0 0 0 0 0
0.667 vs 0.5 contrast
median_sweeping CONSTANd NOMAD quantile1 quantile2 raw
Accuracy 0.996 0.996 0.996 0.995 0.995 0.995
Sensitivity 0.053 0.053 0.105 0.000 0.000 0.000
Specificity 1.000 1.000 1.000 1.000 1.000 1.000
PPV 1.000 1.000 1.000 NaN NaN NaN
NPV 0.996 0.996 0.996 0.995 0.995 0.995
1 vs 0.5 contrast
median_sweeping
CONSTANd
NOMAD
quantile1
quantile2
raw
background spiked background spiked background spiked background spiked background spiked background spiked
not_DE 4060 5 4062 5 4062 5 4064 16 4064 15 4064 19
DE 4 14 2 14 2 14 0 3 0 4 0 0
1 vs 0.5 contrast
median_sweeping CONSTANd NOMAD quantile1 quantile2 raw
Accuracy 0.998 0.998 0.998 0.996 0.996 0.995
Sensitivity 0.737 0.737 0.737 0.158 0.211 0.000
Specificity 0.999 1.000 1.000 1.000 1.000 1.000
PPV 0.778 0.875 0.875 1.000 1.000 NaN
NPV 0.999 0.999 0.999 0.996 0.996 0.995

7.2 Correlation scatter plots

To see whether the three normalization methods produce similar results on the detailed level of individual proteins, we make scatter plots and check the correlation between their fold changes and between their significance estimates (q-values, in our case).

For all conditions, we see the q-values of the non-quantile methods correlate well (\(>0.874\)) with each other, especially for the spike-in proteins with low q-values. Both quantile approaches, have highly correlated q-values and highly correlated fold changes with respect to each other, but not with respect to the other methods.

scatterplot_ils(dat.dea, significance.cols, 'q-values', spiked.proteins, referenceCondition)

The fold change estimates of both the non-quantile and quantile methods correlate very well (\(>0.972\)), though the CONSTANd graph seems rotated: the ratios are disproportionately compressed. It is remarkable that even though the quantile methods seem to underperform across the board, they still produce reliable fold change estimates.

scatterplot_ils(dat.dea, logFC.cols, 'log2FC', spiked.proteins, referenceCondition)

7.3 Volcano plots

The volcano plot combines information on fold changes and statistical significance. The spike-in proteins are colored blue, and immediately it is clear that their fold changes dominate the region of statistical significance, which suggests the experiment and analysis were carried out successfully. The magenta, dashed line indicates the theoretical fold change of the spike-ins.

The quantile plots are extremely flat, suggesting the approach is not powerufl enough. We can again see how CONSTANd fold changes are disproportionately low in absolute value.

for (i in 1:n.contrasts){
  volcanoplot_ils(dat.dea, i, spiked.proteins, referenceCondition)}

7.4 Violin plots

A good way to assess the general trend of the fold change estimates on a more ‘macroscopic’ scale is to make a violin plot. Ideally, there will be some spike-in proteins that attain the expected fold change (red dashed line) that corresponds to their condition, while most (background) protein log2 fold changes are situated around zero.

Clearly, the empirical results tend towards the theoretical truth, but not a single observation attained the fold change it should have attained. There is clearly a strong bias towards zero fold change, which may partly be explained by the ratio compression phenomenon in mass spectrometry, although the effect seems quite extreme here. In general, all fold change distributions (except those of CONSTANd whose distribution is too narrow) are quite similar, even those of the raw data, although it is noticeably more spread out.

# plot theoretical value (horizontal lines) and violin per variant
violinplot_ils(lapply(dat.dea, function(x) x[spiked.proteins, logFC.cols]), referenceCondition)

8 Conclusions

For the given data set, the differences in proteomic outcomes between median sweeping and NOMAD normalization are quite small, both on the global and individual scale. The quantile methods seem to underperform across the board, but they still produce reliable fold change estimates. Finally, CONSTANd naturally reduces the variance in the distribution of quantification values and is only suitable for use with untransformed intensities. When used on log2-transformed values like we did here, there is a double variance-reducing effect that ends up over-compressing the fold change estimates. However, when applied to untransformed intensities like in this bonus notebook, the CONSTANd method performs at least on par with median sweeping!

9 Session information

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=de_BE.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=de_BE.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=de_BE.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_BE.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] NOMAD_0.99.0          dplR_1.7.1            CONSTANd_0.99.5      
##  [4] preprocessCore_1.51.0 forcats_0.5.0         stringr_1.4.0        
##  [7] dplyr_1.0.2           purrr_0.3.4           readr_1.4.0          
## [10] tidyr_1.1.2           tibble_3.0.4          ggplot2_3.3.2        
## [13] tidyverse_1.3.0       limma_3.45.19         psych_2.0.9          
## [16] kableExtra_1.3.1      dendextend_1.14.0     gridExtra_2.3        
## [19] stringi_1.5.3        
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-150         matrixStats_0.57.0   fs_1.5.0            
##  [4] lubridate_1.7.9      webshot_0.5.2        httr_1.4.2          
##  [7] tools_4.0.3          backports_1.1.10     R6_2.4.1            
## [10] rpart_4.1-15         DBI_1.1.0            mgcv_1.8-33         
## [13] colorspace_1.4-1     nnet_7.3-14          withr_2.3.0         
## [16] tidyselect_1.1.0     mnormt_2.0.2         compiler_4.0.3      
## [19] cli_2.1.0            rvest_0.3.6          xml2_1.3.2          
## [22] labeling_0.4.2       scales_1.1.1         digest_0.6.27       
## [25] rmarkdown_2.5        R.utils_2.10.1       pkgconfig_2.0.3     
## [28] htmltools_0.5.0      dbplyr_1.4.4         highr_0.8           
## [31] rlang_0.4.8          readxl_1.3.1         rstudioapi_0.11     
## [34] generics_0.0.2       farver_2.0.3         jsonlite_1.7.1      
## [37] ModelMetrics_1.2.2.2 R.oo_1.24.0          magrittr_1.5        
## [40] Matrix_1.2-18        Rcpp_1.0.5           munsell_0.5.0       
## [43] fansi_0.4.1          viridis_0.5.1        lifecycle_0.2.0     
## [46] R.methodsS3_1.8.1    pROC_1.16.2          yaml_2.2.1          
## [49] MASS_7.3-53          recipes_0.1.14       plyr_1.8.6          
## [52] grid_4.0.3           blob_1.2.1           parallel_4.0.3      
## [55] crayon_1.3.4         lattice_0.20-41      haven_2.3.1         
## [58] splines_4.0.3        hms_0.5.3            tmvnsim_1.0-2       
## [61] knitr_1.30           pillar_1.4.6         stats4_4.0.3        
## [64] reshape2_1.4.4       codetools_0.2-16     reprex_0.3.0        
## [67] XML_3.99-0.5         glue_1.4.2           evaluate_0.14       
## [70] data.table_1.13.2    modelr_0.1.8         foreach_1.5.1       
## [73] png_0.1-7            vctrs_0.3.4          cellranger_1.1.0    
## [76] gtable_0.3.0         assertthat_0.2.1     gower_0.2.2         
## [79] xfun_0.18            prodlim_2019.11.13   broom_0.7.2         
## [82] e1071_1.7-4          survival_3.2-7       class_7.3-17        
## [85] viridisLite_0.3.0    timeDate_3043.102    signal_0.7-6        
## [88] iterators_1.0.13     lava_1.6.8           ellipsis_0.3.1      
## [91] caret_6.0-86         ipred_0.9-9